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ABSTRACT 

Detection of differential expression in RNA-Seq data 
is currently limited to studies in which two or more 
sample conditions are known a priori. However, 
these biological conditions are typically unknown 
in cohort, cross-sectional and nonrandomized 
controlled studies such as the HapMap, the 
ENCODE or the 1000 Genomes project. We present 
DEXUS for detecting differential expression in RNA- 
Seq data for which the sample conditions are 
unknown. DEXUS models read counts as a finite 
mixture of negative binomial distributions in which 
each mixture component corresponds to a condition. 
A transcript is considered differentially expressed if 
modeling of its read counts requires more than one 
condition. DEXUS decomposes read count variation 
into variation due to noise and variation due to dif- 
ferential expression. Evidence of differential expres- 
sion is measured by the informative/noninformative 
(l/NI) value, which allows differentially expressed 
transcripts to be extracted at a desired specificity 
(significance level) or sensitivity (power). DEXUS 
performed excellently in identifying differentially ex- 
pressed transcripts in data with unknown conditions. 
On 2400 simulated data sets, l/NI value thresholds of 
0.025, 0.05 and 0.1 yielded average specificities of 92, 
97 and 99% at sensitivities of 76, 61 and 38%, re- 
spectively. On real-world data sets, DEXUS was 
able to detect differentially expressed transcripts 
related to sex, species, tissue, structural variants or 
quantitative trait loci. The DEXUS R package is 
publicly available from Bioconductor and the 
scripts for all experiments are available at http:// 
www.bioinf.jku.at/software/dexus/. 

INTRODUCTION 

The advent of next-generation sequencing has greatly 
expanded our knowledge about transcriptomes. New 
transcripts and splice variants have been found and 



break points of known transcripts determined more 
accurately (1-6). However, in RNA-Seq experiments, 
quantification of the expression of transcripts can be dif- 
ficult (7). Without biological variability, transcripts that 
are differentially expressed between two conditions can be 
detected reliably (8). In studies with biological variability, 
however, detection of differential expression between two 
conditions remains challenging (9). A transcript that is 
differentially expressed between many conditions is hard 
to detect because read count variation due to differential 
expression and due to high overdispersion can only be 
distinguished with many samples and high coverage. See 
Supplementary Section S2 for more details. To detect dif- 
ferentially expressed transcripts, we therefore assume that 
the number of conditions is small compared with the 
number of samples. 

Identifying differential expression is currently limited to 
particular study designs 

Current methods for analyzing RNA-Seq data can 
identify differential expression between two conditions. 
For example, in a case-control study, only transcripts 
that are differentially expressed between cases and 
controls can be identified. Similarly, in a randomized 
controlled study, differential expression between treated 
and untreated subjects can be detected. These study 
designs can be generalized to more case groups or more 
treatments, which leads to multiple (more than two) 
known conditions. For example, multiple conditions 
may be due to different tissue types, as in the 'Allen 
Brain Atlas' (10), the 'Gene Expression Nervous System 
Atlas' (11), and the 'BioGPS' (12). 

Identification of differential expression in RNA-Seq 
data requires a priori known conditions. In cohort, 
cross-sectional and nonrandomized controlled studies, 
the biological conditions are unknown or only partially 
known. Cohort and cross-sectional studies are observa- 
tional studies in which the conditions of the subjects are 
unknown. Examples of observational studies include the 
HapMap (13), ENCODE (6) and the 1000 Genomes (14) 
project, for which RNA-Seq data are available (15,16). 
Nonrandomized controlled studies are treatment studies 
in which conditions such as genetic, environmental 
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or treatment effects are not completely known. In 
nonrandomized controlled studies, unknown genetic vari- 
ations such as single-nucleotide polymorphisms (SNPs), 
copy number variations and unknown environmental factors 
may result in differential expression between treated 
subjects. Furthermore, individual unknown treatment 
effects may cause variation in gene expression, for instance, 
responses of cell lines to the addition of compounds (17). 
Other examples are found in oncology, where unknown 
cancer subtypes or unknown cancer stages are characterized 
by a particular gene expression profile (18,19). 

In nonclinical studies, the conditions are also often 
unknown. During development, the transcriptome regu- 
lates and controls cell growth, differentiation, movement 
and morphogenesis. Genes are differentially expressed 
between different time points and between different 
tissues; even within one tissue, gene expression may vary 
spatially. For two samples taken at different times or from 
different locations it is often unknown whether the condi- 
tions differ. Another example is in vivo or in vitro gene 
expression in mice treated with drug candidates (20,21). 
Unknown factors such as individual responses or side 
effects lead to differentially expressed transcripts 
between the samples. 

The detection of differential expression in RNA-Seq 
studies with unknown conditions is important to obtain 
new biological knowledge. Current RNA-Seq methods, 
however, require the conditions to be known. For micro- 
array data, a method for identifying unknown conditions 
in gene expression has been suggested (22). However, this 
method cannot be applied to RNA-Seq data with 
unknown conditions because a primary modeled factor 
is required and the noise is assumed to be Gaussian, 
which is not appropriate for RNA-Seq count data (23). 
We therefore present DEXUS, a method capable of de- 
tecting differential expression in RNA-Seq studies with 
unknown conditions. 

A summary of study designs and methods that can 
detect differential expression in them is shown in Table 1 . 

Existing methods for detecting differential expression in 
RNA-Seq data 

Methods that detect differential expression in RNA-Seq 
data are usually based on read counts, i.e. the number of 
reads mapping to a DNA region that is transcribed, such 
as a gene or an exon (32). These methods compare read 



counts for two conditions. If read counts show a large and 
consistent difference between the conditions, then the ac- 
cording transcript is differentially expressed. In this sub- 
section, we review methods that detect differential 
expression in RNA-Seq data. Many methods model read 
counts by a negative binomial distribution because even 
after normalization the read counts have high variance. 
Therefore, we divide methods into two classes: those 
which do not use negative binomials (class A) and those 
which do (class B). 

The following methods belong to class A. 

DEGSeq (28) assumes that the log fold change of mean 
read counts between the two conditions follows a normal 
distribution given the log average expression. A differen- 
tially expressed gene is identified by a small P-value by 
means of this distribution. 

NOISeq (30) also considers the log fold change of read 
counts between two given conditions together with their 
absolute difference. Empirical distributions are calculated 
using all pairs of replicates from different conditions. 
NOISeq identifies a gene as differentially expressed if the 
log fold change of read counts and the absolute difference 
of read counts between the two conditions have both a 
small P-value for the empirical distributions. 

SAMSeq (27) performs a Wilcoxon test for each tran- 
script testing the counts of one condition against the 
counts of the other. Because standard normalization tech- 
niques are not applicable, subsampling is used to normal- 
ize the read counts. SAMSeq requires a relatively high 
number of samples per condition to obtain significance 
for differential expression. 

PoissonSeq (29) fits a Poisson log-linear model to the 
read counts after transforming them. A score statistic on 
the model parameters determines the significance for dif- 
ferential expression. 

The following class B methods use negative binomial 
distributions to model the read counts. 

edgeR (25) uses a quantile-adjusted conditional 
maximum likelihood estimator for the overdispersion par- 
ameter of the negative binomial distribution. This estima- 
tor is more accurate than the standard maximum 
likelihood estimator when only few replicates per condi- 
tion are available (33). Borrowing information across 
transcripts allows the dispersion parameter to be 
adjusted toward a consensus value using an empirical 
Bayes procedure (34). Finally, edgeR uses an exact test 



Table 1. An overview of study designs and methods that can detect differential expression in them 



Study design 


DEXUS 


DESeq 


edgeR 


baySeq 


SAMSeq 


DEGSeq 


PoissonSeq 


NOISeq 


DSS 


Two known conditions 




















Case-control study 


/ 


/ 


/ 


/ 


/ 


/ 


/ 


/ 


/ 


Randomized controlled study 


/ 


/ 


/ 


/ 


/ 


/ 


/ 


/ 


/ 


Multiple known conditions 




















Multiple case-control study 


/ 


/ 


/ 


/ 


/ 










Multiple treatment RCS 


/ 


/ 


/ 


/ 


/ 










Unknown conditions 




















Cross-sectional study 


/ 


















Cohort study 


/ 


















Nonrandomized controlled study 


/ 



















Alongside DEXUS, we included DESeq (24), edgeR (25), baySeq (26). SAMSeq (27), DEGSeq (28), PoissonSeq (29), NOISeq (30) and DSS (31). 
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to determine whether the counts of the two conditions 
come from the same negative binomial distribution. 

DESeq (24) pools together transcripts with similar 
expressions values to improve the estimate of the 
overdispersion parameter. The overdispersion is assumed 
to be a function of the mean read count and is therefore 
estimated per condition. To determine whether a tran- 
script is differentially expressed, the distribution param- 
eters of the two conditions are tested by an exact test for 
equality of means. 

baySeq (26) determines the distribution of the 
overdispersion parameter by applying a quasi-likelihood 
method to the read counts of one condition. The resulting 
distribution is used as prior for estimating the 
overdispersion parameter when fitting the model to the 
read count data. 

DSS (31) is similar to baySeq. A negative binomial dis- 
tribution is fitted to the read count data using a prior on 
the overdispersion parameter. This prior is a log-normal 
distribution, whose parameters are optimized using the 
dispersion parameters of each condition. Finally, a Wald 
test is used to determine differential expression. 

In summary, the class B methods, which use negative 
binomial distributions, i.e. DESeq, baySeq, DSS and 
edgeR, mainly differ in the way they estimate the 
overdispersion parameter. Estimating the overdispersion 
parameter is crucial for the performance and not trivial 
because the maximum likelihood estimator is biased and 
has high variance if the sample size is small (33). The sub- 
sequent statistical test has a smaller effect on the results 
than the parameter estimates (23,31). 

Extensions to multiple known conditions 

McCarthy et al. (32) extended the R package edgeR to 
more than two conditions. A generalized linear model is 
fitted to the data, and then coefficients are tested for being 
different from zero, which leads to the final ^-values. 
Again, the estimation of the overdispersion parameter 
for a transcript borrows information from other tran- 
scripts. DESeq, baySeq and SAMSeq have also been 
extended to more than two conditions. 



MATERIALS AND METHODS 

Method overview 

Our goal is to identify differentially expressed transcripts 
in studies with unknown conditions. A transcript is differ- 
entially expressed if the mean expression levels for differ- 
ent conditions are different and read counts are observed 
under more than one condition. Therefore we assume a 
small number of conditions because, as mentioned above, 
the detection of differential expression for many condi- 
tions is difficult. RNA-Seq expression data are usually 
represented as read counts per transcript, or alternatively 
by exon or gene. It was observed that read counts from a 
single condition follow a negative binomial distribution 
(24—26,31). DEXUS therefore models read counts as a 
finite mixture of negative binomial distributions. 

The model that best explains the observed read counts is 
selected from a set of models. In a Bayesian framework, 



model selection is based on finding the parameter which 
maximizes the posterior, the maximum a posteriori (MAP) 
parameter. The MAP model is found by an expectation 
maximization (EM) algorithm, where E-step and M-step 
are alternated repeatedly. The E-step estimates the 
unknown conditions based on actual model parameters, 
and the M-step optimizes the model parameters based on 
the E-step estimates. Models that use only one condition 
to explain the read counts are preferred by means of a 
prior distribution. One condition is the null hypothesis, 
which is rejected only if the data show strong evidence 
for more than one condition. Therefore, the parameters 
of the prior distribution determine how much DEXUS 
prefers to select models that explain the data without dif- 
ferential expression. Consequently, via the prior param- 
eters, DEXUS can be adjusted to have a low false 
discovery rate at the detection of differential expression. 

In the following subsections, we first describe the model 
in more detail and then explain the EM algorithm for 
model selection. Model selection includes prior assump- 
tions that lower the false discovery rate and lead to 
more accurate estimates. Finally, we show how to call 
differentially expressed transcripts on the basis of an in- 
fo rmative/noninformative (I/NI) value. 

The model 

Read count x per transcript is explained by a mixture of 
n negative binomial distributions: 

p(x) = ^aiNB^ift/,-) (1) 

where a,- is the probability of being in condition i out of n 
possible conditions. In condition i, read counts are drawn 
from a negative binomial distribution with mean /z,- and 
size r h where the size parameter r, is the inverse of the 
overdispersion Note that we use the (fi,r) instead of 
the usual (/u.,0) parametrization to locally accumulate par- 
ameters that are associated with large overdispersions. 
This accumulation is essential to define a prior within a 
Bayesian framework. 

A nondegenerate DEXUS model is identifiable (see 
Supplementary Section S3. 1.3), as required for the 
maximum likelihood and the maximum a posterior esti- 
mator to be consistent. Consistency means that the esti- 
mator converges to the true parameter values with more 
data points, which is important for identifying differential 
expression. If the mean read count exceeds the variance, 
the maximum likelihood estimate of r tends to oo and the 
negative binomial converges to a Poisson distribution (see 
Supplementary Section S3. 2. 2). 

Model selection 

We perform model selection in a Bayesian framework by 
maximizing the posterior, i.e. by a MAP approach 
(35-37). Therefore, the parameters a = {ct\, . . . ,a„), 
H — {lx\,...,fi„) and r = (n, . . . ,r n ) are considered to 
be random variables, and the likelihood p(x) in 
Equation (1) becomes the conditional probability 
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p{x\<x,fi,r). The objective of the model selection is to 
maximize the posterior of the parameters: 



p(fi,r,x\x) 



p(x\fi,r,x) p([i) p(ct) p(r) 
f p(x\(i,r,x) p(fi) p(x) p(r) da. dr dfi 

— - p(x\n,r,<x) pin) />(«) p(r) 
c(x) 



(2) 



where the priors on ot, fi and r are assumed to be inde- 
pendent of each other, and are defined in the following. 

Dirichlet prior for probabilities of conditions 

First we choose the prior p(x) on the probabilities of the 
conditions. Since the majority of transcripts in a data set 
are usually not differentially expressed, the model should 
favor explaining the read counts for a transcript with a 
single condition. The null hypothesis of one condition 
should only be rejected if the data contain strong 
evidence for more than one condition. The prior reduces 
the number of falsely discovered differentially expressed 
transcripts and therefore keeps the false discovery rate 
low. DEXUS uses a Dirichlet prior p(x) on « with param- 
eters y to incorporate the preference for only one 
condition: 



/>(«) = Ki) n 



(3) 



/=1 



where a is an «-dimensional probability vector. Each com- 
ponent a, is distributed according to a beta distribution 
with modeO/) = (/,■ - l)/(£''=i Yi ~ «)■ 

To express the prior knowledge that most transcripts 
are not differentially expressed and are generated under 
only one condition, we set y\ ^> y, (for i > 1). This setting 
assumes that most read counts are generated under con- 
dition i = 1, which we call the major condition, while con- 
ditions i > 1 are called minor conditions. The vector of 
hyperparameters {y\,y%, . . . ,y n ) can be simplified to one 
hyperparameter G (Supplementary Section S3. 2.1). In 
Supplementary Section S3. 4 we show that DEXUS is 
not sensitive to the choice of the hyperparameter G. 
Therefore DEXUS is easy to use as good results are 
obtained with the default setting of G = 1 (see 
Supplementary Section S3. 4). Without having seen the 
data, we assume that only the major condition is 
present, which means that the transcript is not differen- 
tially expressed. Only when the data show strong evidence 
also for minor conditions, does the posterior assign 
nonzero probabilities to minor conditions and the tran- 
script is called differentially expressed. 

Truncated exponential priors for overdispersions 
In DEXUS model selection, the second prior is on the size 
parameter r of the negative binomial distribution, which 
determines the overdispersion. A prior on /• improves the 
estimation of r if the number of samples is small. The 
maximum likelihood estimator of r is biased for few 
samples and overestimates the true size parameter 
(38,39), as confirmed in Supplementary Section S3. 2. 5. 
In a Bayesian approach, the influence of the prior de- 
creases with an increasing number of samples, and 



therefore the MAP estimator is asymptotically (number 
of samples tending to infinity) unbiased. 

To keep the estimate of r small, the prior pushes r 
toward zero. We choose an exponential distribution as 
prior: 



Pir) 



rj e 



(4) 



where r\ is a hyperparameter. 

Like DESeq (24), we truncate the size parameter at the 
right-hand tail by using the constraint r < r max . The upper 
bound r max on the size parameter is equivalent to a lower 
bound on the overdispersion and ensures minimal 
overdispersion for the read counts of each transcript. 
Further, this bound makes the parameter space 
compact, which is required for a consistent estimator. 
The same exponential prior is used for each component 
of r. The hyperparameter i"| for the exponential prior on r 
is transformed to a hyperparameter 9 (see Supplementary 
Section S3. 2. 5). Like the hyperparameter G, also 6 is 
robust and supplies good results with 6 = 2.5. 

Uniform priors for means 

Finally, DEXUS model selection uses a prior on the mean 
u of the negative binomial distribution. If in one condition 
all read counts were close to zero (transcripts are not 
present), the estimate of the mean of the negative 
binomial would not converge. Therefore, /x,- is lower 
bounded by /x m j n . To ensure a compact parameter space 
as required for a consistent estimator, we use a uniform 
prior on /x, on the compact interval [/x m i n ,ii ma x], where 
/Lt max can be set to the largest observed read count. 

In summary, DEXUS has only few parameters which in 
most applications need not be adjusted by the user, as 
their default values give good results. 

EM algorithm 

With the priors defined, the model with maximum param- 
eter posterior can be selected. The EM algorithm (40) is 
used to minimize an upper bound on the negative log-pos- 
terior of the parameters. The E-step of the EM algorithm 
estimates the probability that a read count is generated 
under a particular condition. The M-step optimizes the 
model parameters based on the E-step estimates. 

In the DEXUS model, a,- is the probability of condition 
i without observing any data. The model posterior 6i& es- 
timates the probability that read count x k is generated 
under condition i (the probability of condition after 
observing data x k ): 



01 ik = 



a,- NBfa ; /x,-,r,) 

n 

J2°<i NB(i> ; [x,i,ri) 

;=1 



(5) 



This equation is the E-step (expectation step) of the EM 
algorithm. Using the posterior estimates a,/ f , we obtain 
following update rules for the M-step (maximization step): 



• estimate for a,-: 



1 

N 



N 



(6) 
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fi update: 



fit 



N 

E 

k=\ 



ji E «ft x k 



Of; 



(7) 



• r update: 

The new >■,- is obtained by solving the following 
equation for r{. 



(8) 



/> = 1 



N a/ log 



Ti a,- 



i E "ft X A- + «i 



where xjr is the digamma function. The equation is solved 
numerically for r, by means of a 'bisection 1 procedure. 

• oe update: 



&i + is in - 1) 



En 



(9) 



The complete derivation of the EM algorithm can be 
found in the Supplementary Section S3.2.1. 

fj,i and Yi are initialized by using the results of k-means 
clustering (see Supplementary Section S3. 2. 4). The values 
a, are simply initialized with the maximum entropy setting 
0£i = l/n. 

I/NI value: evidence of differential expression 

The Bayesian framework allows definition of an I/NI call 
(36,37,41,42). The I/NI call serves to extract differentially 
expressed transcripts with a desired specificity (1 — signifi- 
cance level or 1 — type I error rate) or sensitivity (power or 
1 - type II error rate). DEXUS first computes the I/NI 
value, which quantifies the contribution of differential ex- 
pression to the read count variation. Transcripts are then 
called informative if the I/NI value exceeds a threshold. 

Unlike </>,- or r h which capture noise variation, a captures 
variation arising from differentially expressed transcripts. 
The posterior ot of a indicates differential expression in the 
data in the form of minor conditions with probabilities 
larger than zero. The larger the posterior value a,- of a 
minor condition i > 1, the stronger the evidence for its 
presence. Further, evidence is also required that the 
minor condition is different from the major condition in 
terms of mean read counts. Although identifiability of the 
DEXUS model ensures that the negative binomials of dif- 
ferent conditions are different, they may still be close to 
one another. The more the mean of the minor condition 
i > 1 differs from the mean of the major condition, the 
stronger is the evidence that the minor condition is differ- 
ent from the major condition. In conclusion, evaluating 
the evidence of differential expression (the I/NI value) 
should consider two factors: (i) a,- as the evidence for the 



presence of the minor condition i > 1; (ii) the difference 
between the means of the major and minor conditions as 
evidence that they are indeed different. 

The difference between the means is expressed by 
the log difference |log(/x,-) — log(/xi)|. Factor (I) is 
incorporated into the I/NI value by weighting these dif- 
ferences by &i, which yields 



n / 

i/m(%ii) = J2&i io g p 

n 

= ^a, |log(/x,) - log(/xj)| . 



(10) 



The I/NI value is the expected log fold change of read 
counts with respect to the mean read count of the major 
condition given a noise-free model. 'Noise-free' refers to 
the assumption that under each condition, only the mean 
read count is observed. For a mathematical interpretation 
of the I/NI value see Supplementary Section S3. 3. 2. 

Experiments 

We evaluated DEXUS on simulated and real-world data 
sets. The simulated data sets had various library sizes, 
different numbers of replicates and different ratios 
between mean read counts under the different conditions. 
DEXUS was tested on the following real-world RNA-Seq 
data sets: (i) 'Nigerian HapMap', based on 69 Nigerian 
HapMap individuals, (ii) 'European HapMap', based on 
60 European HapMap individuals, (iii) 'Primate Liver', 
based on liver tissue samples from humans, chimpanzees 
and rhesus macaques, (iv) 'Maize Leaves', using samples 
from different locations of maize plant leaves, and (v) 
'Mice Strains', based on different strains of mice 
(Supplementary Section S4.2.4). 

First we report the performance of DEXUS on 2400 
simulated data sets for which the conditions were known 
but withheld from DEXUS. We then present tests on real- 
world data sets with either unknown conditions ('Nigerian 
HapMap', 'European HapMap') or partially known con- 
ditions ('Primate Liver', 'Maize Leaves'). In the latter case 
the conditions were withheld from DEXUS to ascertain 
whether it can identify them. 

DEXUS for known conditions 

Before we tested DEXUS on data with unknown condi- 
tions, we assessed how well it performs if the conditions of 
interest are known. For known conditions, DEXUS esti- 
mates only the parameters of a negative binomial for each 
condition. Therefore, we compared the parameter esti- 
mates of DEXUS to previously suggested methods in 
terms of detecting differentially expressed transcripts, 
namely the following eight state-of-the-art methods: DSS 
(31), DESeq (24), baySeq (26), edgeR (25), DEGseq (28), 
NOISeq (30), PoissonSeq (29) and SAMseq (27). 

If only few samples per condition are available, the per- 
formance of DEXUS is below the best performing other 
methods. For medium and large sample numbers and 
small library size (le6) DEXUS is second and third best 
method. For medium and large sample numbers and large 
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library sizes (le7 and le8) DEXUS outperforms all other 
methods. The experiments and the respective results are 
described in detail in Supplementary Section S4.2. 

Simulated RNA-Seq data 

Generating simulated RNA-Seq data 

We simulated data sets from different experimental 
settings following the suggestions of Robinson et al. 
(34), Hardcaste and Kelly (26) and Wu et al. (31). For 
all samples of a data set, the library size was le6, le7 or 
le8 to cover a wide range of applications. Keeping the 
library size and the read quality constant for each 
sample in a data set avoids the need for normalization 
of the read counts, i.e. it avoids normalization biases. 
For each experiment, we choose a particular number of 
replicates per condition to evaluate DEXUS for different 
sample sizes and for unbalanced data. In case of two con- 
ditions, the numbers of replicates were 6/6, 9/3, 10/2, 11/1, 
12/12, 18/6, 20/4, 22/2 (condition l/condition2). Each ex- 
periment consisted of 100 data sets with 10000 transcripts 
each. The conditions were known and used for evaluation 
but withheld from DEXUS. 

For the simulation we assumed that under condition i 
the reads x for a transcript are distributed according to a 
negative binomial NB(.r; /x,/,). After Wu et al. (31), we 
took the mean /x,- and the size from the 'Mice Strains' 
benchmark RNA-Seq data set (43) using only data from 
one particular biological condition. For a randomly 
selected transcript, the value /x,- was obtained as the 
median read count of the condition. 

The overdispersion </>,■ = 1 /r t tends to decrease with 
increasing mean read counts (see Supplementary Figure 
SI 5). Therefore, we fitted a regression line to 
overdispersion values by least squares. After sampling it,- 
values, the corresponding 0, values were obtained by 
means of the regression line to which zero-one normally 
distributed noise was added. Thirty percent of the genes 
were chosen to be differentially expressed. Differential ex- 
pression was modeled by adjusting the means of the 
negative binomials to obtain log fold changes of 0.5, 1 
and 1.5 between the mean of the major and the minor 
condition. The fold change values are randomly chosen 
with equal probability, such that all 3-fold change 
categories have about the same number of genes in each 
data set. 

Evaluation criteria for simulated RNA-Seq data 

We formulate the detection of differential expression as a 
classification task: DEXUS must decide whether a tran- 
script is differentially expressed (positive prediction) or 
not (negative prediction). For the simulated data, we 
knew which transcripts were differentially expressed (the 
positives) and which were not (the negatives). DEXUS 
ranks the transcripts by the I/NI value from Equation 
(10). For a given I/NI threshold (the I/NI call), we can 
determine true positives, false positives, true negatives and 
false negatives. Using these numbers, we computed the 
specificity and the sensitivity of DEXUS. The specificity 
corresponds to T — significance level' or T — type I error 
rate'. The type I error rate is the ratio between false 



detections and all negatives. The sensitivity corresponds 
to the 'power' or T— type II error rate'. The type II 
error rate is the ratio between missed positives and all 
positives. 

Results on simulated RNA-Seq data 

We tested DEXUS on the simulated RNA-Seq data using 
its default parameters. Table 2 shows the results in terms 
of sensitivity and specificity for library size le8 at different 
thresholds for the I/NI value. Transcripts with an I/NI 
value above the threshold are called informative or 
(equivalently) differentially expressed. Results for other 
library sizes are presented in Supplementary Tables S12 
and SI 3. The specificity of DEXUS is high across various 
numbers of replicates, whereas the sensitivity varies con- 
siderably. High specificity means that few transcripts are 
falsely identified as being differentially expressed. In 
highly unbalanced experiments, i.e. 11/1 and 22/2 repli- 
cates, differentially expressed transcripts are detected 
only at low I/NI thresholds of 0.025 and 0.05. Note that 
the minor condition i = 2 (smaller subgroup) leads to a 
small a 2 and therefore to a small I/NI value. For 
unbalanced data, the few minor condition samples must 
be distinguished from random outliers of the major 
condition. 

Real-world RNA-Seq data 
'Nigerian HapMap' 

Pickrell et al. (16) sequenced RNA from 69 Nigerian 
HapMap individuals to study expression quantitative 
trait loci (eQTLs). The read count data were provided 
by the ReCount repository (44). As in previous experi- 
ments, DEXUS was applied to these data with its 
default parameters and ranked genes according to the 
I/NI value. The read counts of top-ranked genes and the 
conditions identified by DEXUS are visualized as a 
heatmap in Figure 1. 

Five out of the 12 top-ranked genes are located on the Y 
chromosome (RPS4Y1, CYorfl5A, EIF1AY, TMSB4Y, 
RPS4Y2). For these genes, the identified conditions are 
related to the sex. For four of the 12 top-ranked genes, 
at least one eQTL is known. For ZFP57, the associated 
eQTL is the SNP rs 1736924 with a minor allele frequency 
(MAF) of 0.14 (16). CDH1 has 6 eQTLs, one of which is 
SNP rs7196495 with a MAF of 0.22 (45). CLLUIOS 
possesses the eQTL SNP rsl2580153 with a MAF of 
0.19 (46). L1TD1 has 2 eQTLs, one of which is SNP 
rsl2137088 with a MAF 0.30 (47). Because the MAFs 
are high, it is plausible that the minor alleles are 
observed in the HapMap data set and that they lead to 
differential expressions of the associated genes. The con- 
ditions that were found by DEXUS are related to the 
alleles of corresponding SNPs. 

Because the HapMap samples are lymphoblastoid cells, 
we confirmed that the genes detected by DEXUS are 
indeed expressed in lymphoblastoid cell lines. The gene 
NLRP2, ranked 11th by DEXUS, is expressed in 
lymphoblastoid cells but with large variability (48), as 
shown in Supplementary Figure SI 7. NLRP2 is expressed 
in the HapMap individuals, but in some at a low level. 



Page 7 of 1 1 Nucleic Acids Research, 2013, Vol. 41, No. 21 el98 



Table 2. The performance of DEXUS in terms of sensitivity and specificity in detecting differential expression with unknown conditions 



I/NI threshold 
C1/C2 




0.025 




0.05 


0. 1 
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Spiici 1 1 vi i\r 
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ijCwint*i l y 


ijCllal Ll VI L y 


6/6 


0.893 ± 0.003 


0.775 ± 0.009 


0.951 ± 0.002 


0.720 ± 0.009 


0.985 ± 0.002 


0.646 ± 0.009 


9/3 


0.893 ± 0.004 


0.827 ± 0.006 


0.951 ± 0.002 


0.766 ± 0.007 


0.985 ± 0.001 


0.580 ± 0.008 


10/2 


0.893 ± 0.003 


0.819 ± 0.008 


0.950 ± 0.002 


0.656 ± 0.009 


0.985 ± 0.001 


0.325 ± 0.009 


11/1 


0.893 ± 0.003 


0.677 ± 0.009 


0.951 ± 0.002 


0.351 ± 0.008 


0.985 ± 0.001 


0.020 ± 0.003 


12/12 


0.945 ± 0.002 


0.735 ± 0.008 


0.982 ± 0.001 


0.665 ± 0.008 


0.996 ± 0.001 


0.610 ± 0.009 


18/6 


0.945 ± 0.003 


0.816 ± 0.008 


0.982 ± 0.002 


0.743 ± 0.009 


0.996 ± 0.001 


0.570 ± 0.011 


20/4 


0.945 ± 0.003 


0.810 ± 0.008 


0.982 ± 0.002 


0.625 ± 0.009 


0.996 ± 0.001 


0.308 ± 0.009 


22/2 


0.946 ± 0.002 


0.650 ± 0.009 


0.982 ± 0.001 


0.325 ± 0.008 


0.996 ± 0.001 


0.006 ± 0.002 


Mean 


0.919 ± 0.028 


0.764 ± 0.069 


0.966 ± 0.017 


0.606 ± 0.172 


0.991 ± 0.006 


0.383 ± 0.261 



The first column 'C1/C2' contains the numbers of replicates for the first and second condition. The other columns list sensitivity and specificity (with 
standard deviations) of DEXUS at different I/NI thresholds as the average across 100 data sets. The last row ('Mean') gives the average of the results 
in the columns. The library size was le8 for all experiments. 
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Figure 1. Heatmap of the normalized read counts of the 12 genes with the largest I/NI values for the 'Nigerian HapMap' data set. Colors range 
from white for low expression to blue for high expression. Different individuals are denoted along the x-axis, while the top-ranked genes are denoted 
by their gene symbols along the y-axis. Red crosses indicate samples that belong to the minor condition. At the right side of the heatmap, each gene 
is annotated by the minimum ('>'), the median of two conditions ('ml' and 'm2') and the maximum ('<') read count. 



Schlattl et al. (49) identified a copy number variable region 
that partially covers NLRP2 and causes its differential 
expression. Therefore, the conditions that DEXUS 
identified for NLRP2 may be related to copy number 
states of the samples. Copy number states may also be 
responsible for differential expression of MKRN3, which 
was ranked 12th by DEXUS. Pinto et al. (50) and Redon 
et al. (51) identified a copy number variable region 
covering MKRN3. However, interpreting the MKRN3 
conditions is difficult because only the paternal copy of 
MKRN3 is expressed. 

We analyzed the I/NI value ranking of transcripts: 
genes on the X chromosome were ranked significantly 
higher than other genes (P = 3.0e— 12), which can be ex- 
plained by sex-related transcripts. An analogous test for 
the Y chromosome was not significant because too few 
genes were expressed. However, as already mentioned, 
five out of the 12 top-ranked genes are located on the 



Y chromosome. At an I/NI threshold of 0.1, DEXUS 
called 366 differentially expressed genes. Gene set enrich- 
ment analysis showed that the called genes are associated 
with the extracellular region and the plasma membrane. In 
total, 20 significant GO terms were found, including 
'extracellular space', 'extracellular region part' and 
'plasma membrane part' with P = 2.5e— 5, P = 8.8e— 5 
and P = 0.01, respectively. 'Cell-cell signaling', 
'chemokine receptor binding' and 'chemokine activity' 
were also significant at P = 4.0e— 3, P = 8.0e— 4 and 
P = 9.8e— 4 (P-values were corrected for multiple testing 
by means of the Benjamini-Hochberg procedure). These 
GO terms are in agreement with characteristics of 
lymphoblastoid cells. Supplementary Table S18 provides 
a complete list of all significant GO terms. 

'European HapMap' 

We analyzed the RNA-Seq data of 60 individuals from the 
HapMap cohort from Montogmery et al. (15), which were 
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Figure 2. Heatmap of the normalized read counts of the 12 genes with the largest I/NI values for the 'European HapMap' data set. Colors range 
from white for low expression to blue for high expression. Different individuals are denoted along the x-axis, while the top-ranked genes are denoted 
by their gene symbols along the y-axis. Red crosses indicate samples that belong to the minor condition. At the right hand side of the heatmap, each 
gene is annotated by the minimum (">'), the median of two conditions ('ml' and 'm2') and the maximum ('<') read count. 



provided by the ReCount repository (44). Again, DEXUS 
was applied to these data with its default parameters and 
ranked genes according to the I/NI value. The read counts 
of top-ranked genes and the identified conditions are 
visualized as a heatmap in Figure 2. 

RPS4Y1 is the gene with the largest I/NI value, differ- 
entially expressed between males and females, and located 
on the Y chromosome. The genes CYorfl5A and 
TMSB4Y, ranked fourth and fifth according to the I/NI 
value, are also located on the Y chromosome. As in the 
'Nigerian HapMap' data set, ZFP57 was detected as being 
differentially expressed. In addition to ZFP57, two other 
of the 12 top-ranked genes have eQTLs. CLLUIOS has as 
eQTL the SNP rsl2580153 with a MAF of 0.19 (46). 
POU2F3 has as eQTL the SNP rs2847497 with a MAF 
of 0.14 (52). As in the 'Nigerian HapMap' data set, some 
top-ranked genes, such as NLRP2 (again rank 11), were 
differentially expressed owing to variable copy numbers 
(49). For the genes T, PRSS21 and RASSF10, DEXUS 
identified two conditions for which an explanation 
remains to be found and which may indicate a hitherto 
unknown source of variability in gene expression. The 
second-ranked gene T, the third-ranked gene PRSS21 
and the 12th- ranked gene RASSF10 are known to be ex- 
pressed in B-lymphoblastoid cells (6,12), the cell type of 
the HapMap samples. The high expression variability of T 
and PRSS21 in the B-lymphoblastoid cell line has already 
been reported by the ENCODE Project (6). The 
ENCODE Project expression values for the genes T, 
PRSS21 and RASSF10 are visualized in Supplementary 
Figures S19, S20 and S21. 

Analyzing the I/NI value ranking, we found that genes 
on the X chromosome are ranked significantly higher 
(P = 8.0e— 6, Wilcoxon test). The analogous test for the 
Y chromosome yielded no significant results, as too 
few genes were expressed. However, three out of the 12 



top-ranked genes with the largest I/NI value are located 
on the Y chromosome. 

At an I/NI threshold of 0.1, DEXUS called 680 differ- 
entially expressed genes. Gene set enrichment analysis 
showed that the called genes are associated with ion trans- 
port. Significant Gene Ontology (GO) terms were 'ion 
transport', 'potassium ion transport' with P = 0.04 and 
P = 4.3e— 03, respectively. Again 'plasma membrane 
part' was significant at P = 0.027. Although 36 of the 
680 genes were related to 'cell-cell signaling' and 6 to 
'chemokine activity', these GO terms were not significant 
in this data set after correction for multiple testing by 
means of the Benjamini-Hochberg procedure. A table of 
all significant GO terms can be found in Supplementary 
Table S19. 

'Primate Liver' 

Blekhman et al. (53) investigated the differences in alter- 
native splicing in liver tissue between humans, chimpan- 
zees and rhesus macaques. For this purpose, they 
performed RNA-Seq on three male and three female 
liver samples from each species. They focused on the ex- 
pression values of exons that had reliably determined 
orthologs in all species. Read counts for exons were 
provided by Blekhman et al. (53), who used gene models 
from Ensemble (Release 50). After pooling technical rep- 
licates, DEXUS ranked genes according to the I/NI value 
using its default parameters. The 10 top-ranked genes are 
visualized in Figure 3, which shows clear differential ex- 
pression between the species. For these genes, and without 
having been provided with this information, DEXUS 
determined one of the three species as minor condition. 
Interestingly, out of the 10 top-ranked genes, six are 
human pseudogenes, AC010591.10, AC105383.3, 
AC093874.3-1, AC105383.3, AL132855.4 and UOX, 
which are inactive in humans because of recent structural 
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Figure 3. Heatmap of the normalized read counts of the 10 genes with 
the largest I/NI values for the 'Primate Liver' data set. Colors range 
from white for low expression to blue for high expression. The x-axis 
shows female and male individuals from the three species human Homo 
sapiens (HS), chimpanzee P. troglodytes (PT) and rhesus macaques 
M. mulatto (MM). The y-axis displays top-ranked genes indicated by 
their gene symbols. Red crosses mark samples that were assigned to the 
minor condition. At the right side of the heatmap, each gene is 
annotated by the minimum ('>'), the median of two conditions ('ml' 
and 'm2') and the maximum ('<') read count. 



rearrangements (54). Because the rearrangements are 
recent, their orthologs can be identified reliably in other 
primates. Differential expression is detected because these 
orthologs are still transcribed in Pan troglodytes and in 
Macaco mulatta. 

Several of the 10 top-ranked genes are associated with 
liver pathways and are therefore expressed in liver tissues. 
Differential expression of these genes between species may 
have arisen from different diets. Examples of such genes 
are the human pseudogene UOX, which catalyze the oxi- 
dation of uric acid to allantoin in M. mulatta, ABP1 and 
GSTM5, which participate in degradation and detoxifica- 
tion pathways, VNN3, which helps to recycle vitamin B5, 
and CHFR2, which is associated with lipoproteins. 

Thresholding the I/NI call at 0.1, DEXUS called 3384 
genes (16% of all genes) as differentially expressed. A gene 
set enrichment analysis found the GO terms 'intrinsic to 
plasma membrane' (P = 7.9e— 7) and 'integral to plasma 
membrane' (P = 4.0e— 6) to be significant. Thus, genes 
that encode membrane proteins seem to be differentially 
expressed between species more often than other genes. 
Interestingly, also 'response to extracellular stimulus', 
'response to nutrient' and 'response to nutrient levels' 
were significant (all /"-values <7.6e— 5), which supports 
the hypothesis that some genes are differentially expressed 
owing to the different diets of the species. All /"-values were 
corrected by means of the Benjamini-Hochberg procedure. 
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Figure 4. Heatmap of the normalized read counts of the 10 genes with 
the largest DEXUS I/NI values for the 'Maize Leaves' data set. Colors 
range from white for low expression to blue for high expression. The 
x-axis shows samples from different locations on the maize plant leaf. 
The y-axis displays different genes denoted by their gene symbols. Red 
crosses indicate that the according samples belong to the minor condi- 
tion. At the right hand side of the heatmap, each gene is annotated by 
the minimum ('>'), the median of two conditions ('ml' and 'm2') and 
the maximum ('<') read count. 



'Maize Leaves' 

Using RNA-Seq data from various locations on maize 
plant leaves, Li et al. (55) studied the developmental 
dynamics of the maize transcriptome. For each location, 
two biological replicates were sequenced with Illumina's 
Genome Analyzer II. The reads were mapped to the TE- 
masked Zea maize ZmB73 reference genome version 2 
(AGPv2), release 5a, using the GSNAP splicing short 
read mapper (56). We counted the overlaps between 
mapped reads and the Z. maize gene definitions from 
the Ensemble Plants database (Release 14). Reads that 
have multiple possible alignments or that overlap with 
more than one gene were discarded. DEXUS was 
applied to this data with its default parameters. 

Figure 4 shows the genes with the largest I/NI value and 
the conditions that were identified by DEXUS. DEXUS 
found differences in gene expressions between different 
locations on the leaf despite this information being 
withheld. Further, it almost always assigned the two rep- 
licates to the same condition without knowledge of repli- 
cates or leaf locations. Thus, DEXUS assigns conditions 
reliably. 

Eight of the 10 top-ranked genes were also measured by 
means of microarrays across different leaf locations of 
Z. mays (57). In this microarray experiment, all eight 
genes show an absolute log fold change of at least 1 
between base and tip. Six of these eight genes show an 
absolute log fold change greater than four. 

The two remaining genes, GRMZM2G331518 and 
AC213612.3_FG001, were not annotated on the 
microarray. The function of the top-ranked gene 
GRMZM2G331518 is not known. However, the 
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associated peptide is similar to the defensin-like protein 91 
of Arabidopsis thaliana, which plays a role in immune 
response. The gene ranked ninth, AC213612.3_FG001, is 
a glycine-rich cell wall structural protein, which is com- 
patible with cell walls at different locations having differ- 
ent structure. 

At a threshold of 0.1 for the I/NI call, DEXUS called 
15 756 differentially expressed genes. Gene set enrichment 
analysis using the R package goseq (58) yielded to the 
significant GO terms 'chloroplast' (P = 1.8e— 92) and 
'plasma membrane' (P = 1.3e— 34). Further, the GO 
terms 'cytosolic ribosome' (P = 9.8e— 32), 'chloroplast 
thylakoid membrane' (P = 5.4e— 31) and 'chloroplast 
stroma' (P = 1.8e— 30) were significant. All P-values 
were corrected by means of the Benjamini-Hochberg pro- 
cedure. It is plausible that that chloroplast also differs at 
different locations on the maize plant leaf. The GO term 
'cell wall' was highly significant (P = 3.9e— 18), which 
supports the above-mentioned hypothesis that the cell 
walls differ at different locations on the plant leaf. 



CONCLUSION 

We have introduced DEXUS, an algorithm that identifies 
differentially expressed transcripts in RNA-Seq data with 
unknown conditions. DEXUS is appropriate for use with 
data from cohort, cross-sectional and nonrandomized 
controlled studies, where conditions are often unknown. 
In experiments with simulated and real-world data with 
known conditions, DEXUS successfully found differential 
expressed transcripts and conditions, although the condi- 
tions were withheld from DEXUS. For HapMap individ- 
uals, DEXUS detected differentially expressed transcripts, 
the vast majority of which are related to sex, eQTLs or 
structural variants. We envisage that DEXUS will evolve 
into an important tool for analyzing RNA-Seq data in 
studies with unknown conditions and thus for obtaining 
new biological and medical knowledge. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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